{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Groupby_agg : Shortcut for assigning a groupby-transform to a new column."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Background"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook serves to show how to use the `groupby_agg` method from pyjanitor's general functions submodule.\n",
    "\n",
    "The `groupby_agg` method allows us to add the result of an aggregation from a grouping, as a new column, back to the dataframe.\n",
    "\n",
    "Currently in pandas, to append a column back to a dataframe, you do it in three steps:\n",
    "1. Groupby a column or columns\n",
    "2. Apply the `transform` method with an aggregate function on the grouping, and finally\n",
    "3. Assign the result of the transform to a new column in the dataframe.\n",
    "\n",
    "In pseudo-code, this might look something like:\n",
    "```python\n",
    "df = df.assign(\n",
    "    new_column_name=df.groupby(...)[...].transform(...)\n",
    ")\n",
    "```\n",
    "\n",
    "The `groupby_agg` method allows you to achieve the same result in a single function call and with sensible arguments. The example below illustrates the use of this function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load modules\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start off with a simple example.\n",
    "Given a `df` as defined below, we wish to use `groupby_agg` to find the average price for each item, and join the results back to the original dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"item\": [\"shoe\", \"shoe\", \"bag\", \"shoe\", \"bag\"],\n",
    "        \"MRP\": [220, 450, 320, 200, 305],\n",
    "        \"number_sold\": [100, 40, 56, 38, 25],\n",
    "    }\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the output of `groupby_agg` contains the same number of rows as the input dataframe, i.e., the operation here is a groupby + transform.\n",
    "\n",
    "Here, `by` is the name(s) of the column(s) being grouped over. `agg` is the aggregate function (e.g. sum, mean, count...), which is beinng applied to the data in the column specified by `agg_column_name`.\n",
    "Finally, `new_column_name` is the name of the newly-added column containing the transformed values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.groupby_agg(\n",
    "    by=\"item\",\n",
    "    agg=\"mean\",\n",
    "    agg_column_name=\"MRP\",\n",
    "    new_column_name=\"Avg_MRP\",\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Specifying multiple columns to group over"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The basic example shown above specified a single column in `by` to group over.\n",
    "Grouping over multiple columns is also supported in general, since `groupby_agg` is just using the standard pandas `DataFrame.groupby` method under the hood.\n",
    "\n",
    "An example is shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"date\": pd.date_range(\"2021-01-12\", periods=5, freq=\"W\"),\n",
    "        \"item\": [\"sneaker\", \"boots\", \"sneaker\", \"bag\", \"bag\"],\n",
    "        \"MRP\": [230, 450, 300, 200, 305],\n",
    "    }\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.groupby_agg(\n",
    "    by=[\"item\", df[\"date\"].dt.month],\n",
    "    agg=\"mean\",\n",
    "    agg_column_name=\"MRP\",\n",
    "    new_column_name=\"Avg_MRP_by_item_month\",\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The `dropna` parameter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the column(s) being grouped over (`by`) contains null values, you can include the null values as its own individual group, by passing `False` to `dropna`. Otherwise, the default behaviour is to `dropna=True`, in which case, the corresponding transformed values (in `new_column_name`) will be left as NaN.\n",
    "This feature was introduced in Pandas 1.1.\n",
    "\n",
    "You may read more about this parameter in the [Pandas user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#id2)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"name\": (\"black\", \"black\", \"black\", \"red\", \"red\"),\n",
    "        \"type\": (\"chair\", \"chair\", \"sofa\", \"sofa\", \"plate\"),\n",
    "        \"num\": (4, 5, 12, 4, 3),\n",
    "        \"nulls\": (1, 1, np.nan, np.nan, 3),\n",
    "    }\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get the value counts of the values in the `nulls` column.\n",
    "Compare the two outputs from the following cell when `dropna` is set to True and False respectively:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"With dropna=True (default)\")\n",
    "filtered_df = df.groupby_agg(\n",
    "    by=[\"nulls\"],\n",
    "    agg=\"size\",\n",
    "    agg_column_name=\"type\",\n",
    "    new_column_name=\"counter\",\n",
    "    dropna=True,\n",
    ")\n",
    "display(filtered_df)\n",
    "\n",
    "print(\"With dropna=False\")\n",
    "filtered_df = df.groupby_agg(\n",
    "    by=[\"nulls\"],\n",
    "    agg=\"size\",\n",
    "    agg_column_name=\"type\",\n",
    "    new_column_name=\"counter\",\n",
    "    dropna=False,\n",
    ")\n",
    "display(filtered_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Method chaining"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `groupby_agg` method can be extended for different purposes. One of these is groupwise filtering, where only groups that meet a condition are retained.\n",
    "Let's explore this with an example, reusing one of the small dataframe from before:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"name\": (\"black\", \"black\", \"black\", \"red\", \"red\"),\n",
    "        \"type\": (\"chair\", \"chair\", \"sofa\", \"sofa\", \"plate\"),\n",
    "        \"num\": (4, 5, 12, 4, 3),\n",
    "        \"nulls\": (1, 1, np.nan, np.nan, 3),\n",
    "    }\n",
    ")\n",
    "\n",
    "filtered_df = df.groupby_agg(\n",
    "    by=[\"name\", \"type\"],\n",
    "    agg=\"size\",\n",
    "    agg_column_name=\"type\",\n",
    "    new_column_name=\"counter\",\n",
    ").query(\"counter > 1\")\n",
    "filtered_df"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
